AITopics | visual representation learning

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Neural Information Processing SystemsMar-17-2026, 20:28:52 GMT

In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations.

artificial intelligence, joint-embedding predictive architecture, proceedings, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.58)

Add feedback

971f1e59cd956cc094da4e2f78c6ea7c-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 00:44:51 GMT

dataset, real image, weight decay 0, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.42)

Add feedback

UnsupervisedRepresentationTransferforSmall Networks: IBelieveICanDistillOn-the-Fly

Neural Information Processing SystemsFeb-11-2026, 06:06:23 GMT

Foreffectiveknowledge transfer,weadopt the idea of domain classifier so that student training is guided by discriminative features invariant totherepresentational space shift between teacher andstudent.

artificial intelligence, learning, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > South Korea > Gyeonggi-do > Suwon (0.04)

Industry: Education (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Supplementary Material for Self-Supervised Visual Representation Learning with Semantic Grouping Xin Wen

Neural Information Processing SystemsFeb-9-2026, 13:37:02 GMT

There are two operations in our data augmentation pipeline that changes the scale or layout of the image, i.e ., random resized crop and random horizontal flip. This is followed by a resize operation to recover the intersect part to the original size ( e.g ., RoIAlign to recover the original spatial layout. The total stride is 16 (FCN-16s [20]). Intuitively, each prototype can be viewed as the cluster center of a semantic class. During inference, we only take the teacher model parameterized by ξ .

machine learning, natural language, segmentation, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Davidson County > Nashville (0.05)
Asia > China > Hong Kong (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

ETAB: A Benchmark Suite for Visual Representation Learning in Echocardiography

Neural Information Processing SystemsDec-24-2025, 12:47:04 GMT

Echocardiography is one of the most commonly used diagnostic imaging modalities in cardiology. Application of deep learning models to echocardiograms can enable automated identification of cardiac structures, estimation of cardiac function, and prediction of clinical outcomes. However, a major hindrance to realizing the full potential of deep learning is the lack of large-scale, fully curated and annotated data sets required for supervised training. High-quality pre-trained representations that can transfer useful visual features of echocardiograms to downstream tasks can help adapt deep learning models to new setups using fewer examples. In this paper, we design a suite of benchmarks that can be used to pre-train and evaluate echocardiographic representations with respect to various clinically-relevant tasks using publicly accessible data sets. In addition, we develop a unified evaluation protocol---which we call the echocardiographic task adaptation benchmark (ETAB)---that measures how well a visual representation of echocardiograms generalizes to common downstream tasks of interest. We use our benchmarking framework to evaluate state-of-the-art vision modeling pipelines. We envision that our standardized, publicly accessible benchmarks would encourage future research and expedite progress in applying deep learning to high-impact problems in cardiovascular medicine.

benchmark suite, echocardiography, visual representation learning, (7 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

971f1e59cd956cc094da4e2f78c6ea7c-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 02:01:23 GMT

artificial intelligence, machine learning, real image, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)

Add feedback

Supplementary Material for Self-Supervised Visual Representation Learning with Semantic Grouping

Neural Information Processing SystemsAug-15-2025, 13:11:42 GMT

There are two operations in our data augmentation pipeline that changes the scale or layout of the image, i.e ., random resized crop and random horizontal flip. This is followed by a resize operation to recover the intersect part to the original size ( e.g ., RoIAlign to recover the original spatial layout. The total stride is 16 (FCN-16s [20]). Intuitively, each prototype can be viewed as the cluster center of a semantic class. During inference, we only take the teacher model parameterized by ξ .

computer vision, segmentation, semantic segmentation, (10 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Davidson County > Nashville (0.05)
Asia > China > Hong Kong (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Neural Information Processing SystemsMay-26-2025, 15:07:27 GMT

In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.

artificial intelligence, joint-embedding predictive architecture, machine learning, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)

Add feedback

ETAB: A Benchmark Suite for Visual Representation Learning in Echocardiography

Neural Information Processing SystemsJan-14-2025, 02:12:28 GMT

Echocardiography is one of the most commonly used diagnostic imaging modalities in cardiology. Application of deep learning models to echocardiograms can enable automated identification of cardiac structures, estimation of cardiac function, and prediction of clinical outcomes. However, a major hindrance to realizing the full potential of deep learning is the lack of large-scale, fully curated and annotated data sets required for supervised training. High-quality pre-trained representations that can transfer useful visual features of echocardiograms to downstream tasks can help adapt deep learning models to new setups using fewer examples. In this paper, we design a suite of benchmarks that can be used to pre-train and evaluate echocardiographic representations with respect to various clinically-relevant tasks using publicly accessible data sets.

benchmark suite, echocardiography, visual representation learning, (4 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Visual Representation Learning with Stochastic Frame Prediction

Jang, Huiwon, Kim, Dongyoung, Kim, Junsu, Shin, Jinwoo, Abbeel, Pieter, Seo, Younggyo

arXiv.org Artificial IntelligenceJun-11-2024

Self-supervised learning of image representations by predicting future frames is a promising direction but still remains a challenge. This is because of the under-determined nature of frame prediction; multiple potential futures can arise from a single current frame. To tackle this challenge, in this paper, we revisit the idea of stochastic video generation that learns to capture uncertainty in frame prediction and explore its effectiveness for representation learning. Specifically, we design a framework that trains a stochastic frame prediction model to learn temporal information between frames. Moreover, to learn dense information within each frame, we introduce an auxiliary masked image modeling objective along with a shared decoder architecture. We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner. We demonstrate the effectiveness of our framework on a variety of tasks from video label propagation and vision-based robot learning domains, such as video segmentation, pose tracking, vision-based robotic locomotion, and manipulation tasks. Code is available on the project webpage: https://sites.google.com/view/2024rsp.

artificial intelligence, machine learning, representation, (12 more...)

arXiv.org Artificial Intelligence

2406.07398

Country:

Europe > Austria > Vienna (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (0.64)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Filters

Collaborating Authors

visual representation learning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

971f1e59cd956cc094da4e2f78c6ea7c-Supplemental-Conference.pdf

UnsupervisedRepresentationTransferforSmall Networks: IBelieveICanDistillOn-the-Fly

Supplementary Material for Self-Supervised Visual Representation Learning with Semantic Grouping Xin Wen

ETAB: A Benchmark Suite for Visual Representation Learning in Echocardiography

971f1e59cd956cc094da4e2f78c6ea7c-Supplemental-Conference.pdf

Supplementary Material for Self-Supervised Visual Representation Learning with Semantic Grouping

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

ETAB: A Benchmark Suite for Visual Representation Learning in Echocardiography

Visual Representation Learning with Stochastic Frame Prediction